Data Science project by Elina Yancheva
DataFrame Columns: - date - media_id - media_type - product_id - pan - tap_lat - tap_lon - is_network_og - transport_type - og_line_id - origin_ts - origin_stop_id - origin_stop_lat - origin_stop_lon - task_id - boarding_dist - boarding_t - is_transfer - transfer_t - transfer_dist - dest_ts - dest_stop_id - dest_stop_lat - dest_stop_lon - id
Dataset shape: (55978, 25) Invalid data counts by specific validation rules: transfer_t: 236 invalid values (0.42%) Total cells in dataset: 1399450 Total invalid cells: 236 Percentage of invalid data: 0.02% Rows with origin timestamp >= destination timestamp: 0 (0.00%) Duplicate IDs: 23888 (42.67%)
Check for negative values in numeric columns where negative values would be problematic
Negative Value Checks: Extreme negative transfer times (< -10 min): 24 records Invalid transfer times by transport type: transport_type metro 138 bus 70 trolleybus 22 tram 12 Name: count, dtype: int64 Transfers with zero transfer time: 6 records
Number of unique IDs that appear multiple times: 13229 Total number of rows with duplicate IDs: 23885 Analyzing duplicate ID rows: Out of 13229 IDs with duplicates: - 0 IDs have completely identical rows - 13229 IDs have differences between rows The dataset contains a single date: 2024-03-01
All duplicate IDs (13,230 unique IDs appearing 23,888 times) represent different transport journeys rather than identical data entries. Since these represent legitimate separate journeys might have been made by the same user/card, we will retain all rows in our analysis without filtering for unique IDs.
Total rows with duplicate IDs: 37114 Percentage of duplicate ID rows where is_transfer = True: 46.61% Percentage of duplicate ID rows where is_transfer = False: 53.39% Analyzing transfer patterns by journey sequence: Most common transfer patterns (T=transfer, N=non-transfer): TN: 4500 IDs (34.02%) NN: 1641 IDs (12.40%) TNT: 965 IDs (7.29%) NT: 825 IDs (6.24%) TNN: 564 IDs (4.26%) TT: 521 IDs (3.94%) NNN: 514 IDs (3.89%) NTN: 369 IDs (2.79%) TNTN: 358 IDs (2.71%) TTN: 308 IDs (2.33%)
Check if all duplicate ids involve transfers, to check if a single journey with many segments is represented by multiple rows or if they are different journeys with the same ID.
Analysis of the 37,118 journeys with duplicate IDs reveals that 46.61% involve transfers, with the most common pattern (34.01%) being a transfer followed by a non-transfer journey (TN). This suggests that many users make connected trips where they transfer once and then complete their journey directly, while the presence of various multi-segment patterns (TNT, TNN, TNTN) indicates more complex travel behaviors spanning multiple transit modes. The following analysis assumes that all duplicate IDs represent different journeys, by the same person.
media_id: 23.64% missing product_id: 79.65% missing pan: 76.36% missing og_line_id: 67.44% missing task_id: 67.44% missing boarding_dist: 67.44% missing
transport_type metro 37751 Name: count, dtype: int64 --- All metro records don't have og_line_id: True
Metro has the most missing data across all fields, with complete absence of product_id, og_line_id, task_id, and boarding_dist values. This suggests metro journeys are tracked differently in the system.
Value counts for dest_datetime (date only): dest_datetime 2025-03-01 55973 2025-03-02 2 Name: count, dtype: int64 Value counts for date: date 2024-03-01 55975 Name: count, dtype: int64
While all records show the journey date as 2024-03-01, the destination timestamps overwhelmingly resolve to 2025-03-01 with two outliers on 2025-03-02 (maybe short after midnight). This one-year difference between date (journey dates) and destination timestamps indicates a systematic timestamp error.
Most Frequent Line IDs by Transport Type
Summary of the day Bus: - Most frequent line: 280 with 676 trips (7.4% of all bus trips) - Total number of unique lines: 78 - Total trips recorded: 9133 Tram: - Most frequent line: 5 with 761 trips (13.7% of all tram trips) - Total number of unique lines: 16 - Total trips recorded: 5542 Trolleybus: - Most frequent line: 73 with 602 trips (17.0% of all trolleybus trips) - Total number of unique lines: 16 - Total trips recorded: 3549
Hourly Usage Patterns:
Average transfer waiting time: 7.45 minutes
Average transfer distance: 0.06 kilometers
These patterns suggest that while metro is the dominant transport mode, bus connections play a crucial role in the overall network connectivity. The presence of negative transfer times and the unusual metro distance distribution suggest further investigation into data quality and how transfers are recorded in the system.
Records by media type:
media_type desfire 41499 EMV 13231 ulc 1245 Name: count, dtype: int64
Distribution by Media Type and ID Type:
| With PAN | With Media ID | Total Records | |
|---|---|---|---|
| EMV | 13231 | 0 | 13231 |
| desfire | 0 | 41499 | 41499 |
| ulc | 0 | 1245 | 1245 |
Summary of user identifiers:
One-time: Users who traveled only once. Regular: Users who traveled 2-3 times. Frequent: Users who traveled 4-9 times. Very Frequent: Users who traveled more than 10 times. User segments based on journey count: One-time: 18861 users (58.8%) Regular: 10618 users (33.1%) Frequent: 2575 users (8.0%) Very Frequent: 36 users (0.1%) Commuter ratio: 2.00% of users are commuters
EMV card usage: 23.6% Other card types: 76.4%
The Saturday data reveals predominantly one-time travelers (58.8%) and regular users making 2-3 trips (33.1%), with significantly fewer frequent users (8.0%) and very few very frequent users (10+ trips) at just 0.1%.
This distribution is typical for weekend travel, with the low commuter ratio (2.34%) reflecting weekend-specific behavior. Most weekend passengers are making targeted leisure trips, shopping excursions, or social visits rather than the repetitive commuting patterns seen on weekdays.
Just 23.6% of travelers used EMV cards (typically occasional users paying standard fares), while the vast majority (76.4%) used specialized transit cards, suggesting most weekend travelers are still regular or subscription-holding passengers despite the reduced commuter pattern.
Applying clustering techniques to identify distinct patterns in public transport usage that might not be apparent through simple statistical analysis. Clustering helps discover natural groupings in the data based on multiple features simultaneously (trip duration, distance, speed, transport type, transfers, and time of day).
For example, without scaling, a 5-minute difference in trip duration (small relative to the ~24 minute range) would outweigh a change from no transfer to transfer (which is the maximum possible change in that feature).
After standardization, all features are expressed in the same unit: standard deviations from the mean. This makes them directly comparable and ensures no feature dominates the distance calculations in algorithms like K-means simply because it has larger values.
Cluster Centers:
| direct_distance_km | trip_duration_min | avg_speed_kmh | transport_type_code | is_transfer | hour_of_day | |
|---|---|---|---|---|---|---|
| 0 | 7.288520 | 22.691186 | 20.226426 | 0.902816 | 2.034468e-01 | 9.820177 |
| 1 | 2.647106 | 10.021272 | 16.666801 | 0.824825 | 2.986500e-14 | 10.672575 |
| 2 | 3.118071 | 11.107689 | 17.348273 | 0.801027 | 1.000000e+00 | 11.252833 |
| 3 | 2.014908 | 9.197925 | 13.991220 | 2.684802 | 3.572330e-01 | 10.990406 |
These "strange" values occur because cluster centers represent the average of all points in a cluster.
Cluster 0: Long-distance trips (7.3 km), longest duration (22.7 min), fastest speed (20.2 km/h), mix of transport types but mostly metro, 20% transfers, early morning (9.8 hour = 9:48 AM)
Cluster 1: Medium-distance trips (2.6 km), medium duration (10 min), medium speed (16.7 km/h), mostly metro, virtually no transfers, mid-morning (10.7 hour)
Cluster 2: Medium-distance trips (3.1 km), medium duration (11.1 min), medium-fast speed (17.3 km/h), mostly metro, 100% transfers, late morning (11.3 hour)
Cluster 3: Short-distance trips (2 km), shortest duration (9.2 min), slowest speed (14 km/h), predominantly trolley and tram, 36% transfers, mid-morning (11 hour)
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) because:
Testing 30 parameter combinations on 5000 sample points... Top DBSCAN Parameter Combinations:
| eps | min_samples | n_clusters | n_noise | noise_ratio | silhouette | |
|---|---|---|---|---|---|---|
| 17 | 0.5 | 50 | 2 | 4878 | 0.9756 | 0.367025 |
| 28 | 1.0 | 100 | 2 | 2086 | 0.4172 | 0.274956 |
| 22 | 0.7 | 50 | 4 | 2608 | 0.5216 | 0.256140 |
| 10 | 0.3 | 10 | 24 | 4250 | 0.8500 | 0.170623 |
| 21 | 0.7 | 20 | 7 | 1151 | 0.2302 | 0.127688 |
| 26 | 1.0 | 20 | 8 | 375 | 0.0750 | 0.126463 |
| 27 | 1.0 | 50 | 7 | 945 | 0.1890 | 0.123067 |
| 20 | 0.7 | 10 | 10 | 597 | 0.1194 | 0.122569 |
| 16 | 0.5 | 20 | 6 | 2892 | 0.5784 | 0.121187 |
| 25 | 1.0 | 10 | 8 | 194 | 0.0388 | 0.119380 |
Applying optimal parameters to full dataset... Number of clusters in full dataset: 10 Noise points in full dataset: 1063 (1.90%) Cluster sizes:
dbscan_cluster -1 1063 0 3692 1 24943 2 2173 3 1154 4 4851 5 12672 6 2059 7 3340 8 28 Name: count, dtype: int64
Cluster statistics:
| direct_distance_km | trip_duration_min | avg_speed_kmh | is_transfer | hour_of_day | ||||
|---|---|---|---|---|---|---|---|---|
| mean | std | mean | std | mean | std | mean | mean | |
| dbscan_cluster | ||||||||
| -1 | 7.347782 | 4.404736 | 24.439284 | 15.892683 | 22.922333 | 10.512042 | 0.608655 | 11.669802 |
| 0 | 2.886126 | 2.160586 | 11.184499 | 7.178226 | 15.783792 | 5.290962 | 1.000000 | 11.146804 |
| 1 | 4.134472 | 2.557188 | 13.594033 | 6.863517 | 18.189080 | 6.407091 | 0.000000 | 10.417151 |
| 2 | 2.504481 | 1.612838 | 11.276460 | 7.432770 | 14.115128 | 4.351840 | 0.000000 | 10.803037 |
| 3 | 2.149410 | 1.382308 | 9.197708 | 5.843923 | 14.585703 | 4.155787 | 1.000000 | 11.487868 |
| 4 | 3.140819 | 2.313891 | 12.954868 | 9.530624 | 15.420983 | 5.587265 | 0.000000 | 10.179551 |
| 5 | 3.868322 | 2.132192 | 12.968212 | 5.935597 | 18.039500 | 6.473356 | 1.000000 | 11.038668 |
| 6 | 1.932517 | 1.297943 | 8.611904 | 5.866159 | 14.361062 | 5.226110 | 1.000000 | 11.338514 |
| 7 | 2.230385 | 1.494415 | 10.111399 | 7.020781 | 14.122665 | 5.002640 | 0.000000 | 10.597006 |
| 8 | 8.551643 | 0.466971 | 45.642064 | 2.667642 | 11.287417 | 1.000630 | 1.000000 | 12.571429 |
This analysis of Sofia's public transport system reveals distinct usage patterns characterized by four primary journey profiles, with metro being the dominant transit mode and showing clear morning and mid-day peak periods, due to the weekend. The clustering methods effectively identified trip segments ranging from long-distance, high-speed metro journeys to shorter, slower surface transit trips, providing valuable insights for future service planning and optimization.